We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The survey program has been conducted as a series of surveys designed to assess the health and nutritional status of adults and children in the United States since the 1960s, according to CDC (2023). It combines in-person face-to-face interviews and physical examinations of participants for data collection.
The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.
We aim to study the relationship between the weight variable and the other health related variables of the data.
We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests on some of the variables. Lastly we did a linear regression model fit to the response variable “weight” with other variables and confounders.
We began our analysis by giving a data dictionary of the data shown in Table 1 below. As one can see that some variables have a high percentage of missing values. In Part 2 we made hypothesis tests to decide if some of these variables could be excluded from the regression analysis in Part 3.
The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. As one could see there was an obese variable in the data. The weight variable was categorized by giving a threshold of 35 to the BMI value. A person is considered healthy if the BMI is below 35, and obese otherwise. Therefore, we used the obese variable as the categorical random variable in our project.
| Variables | Type | Example | Number.Unique | MissingPct | Comment |
|---|---|---|---|---|---|
| id | integer | 1, 2, 3 | 6482 | 0% | Identification Code (1 - 6482) |
| gender | factor | Male, Female | 2 | 0% | Gender (1: Male, 2: Female) |
| age | integer | 34, 16, 60 | 65 | 0% | Age (Years) |
| marstat | factor | Married, NA, Widowed | 6 | 9.7% | Marital Status (1: Married, 2: Widowed, 3: Divorced, 4: Separated, 5: Never Married, 6: Living Together) |
| samplewt | numeric | 80100.544, 13953.078, 20090.339 | 2499 | 0% | Statistical Weight (4084.478 - 153810.3) |
| psu | integer | 1, 2 | 2 | 0% | Pseudo-PSU (1, 2) |
| strata | integer | 9, 10, 1 | 15 | 0% | Pseudo-Stratum (1 - 15) |
| tchol | integer | 135, 192, 202 | 251 | 6.09% | Total Cholesterol (mg/dL) |
| hdl | integer | 50, 60, 45 | 112 | 6.09% | HDL-Cholesterol (mg/dL) |
| sysbp | integer | 114, 112, 154 | 61 | 8.53% | Systolic Blood Pressure (mm Hg) |
| dbp | integer | 88, 62, 70 | 40 | 9.16% | Diastolic Blood Pressure (mm Hg) |
| wt | numeric | 87.400002, 72.300003, 116.8 | 957 | 0.57% | Weight (kg) |
| ht | numeric | 164.7, 181.3, 166 | 527 | 0.57% | Standing Height (cm) |
| bmi | numeric | 32.22, 22, 42.39 | 2276 | 0.57% | Body mass Index (Kg/m^2) |
| vigwrk | factor | No, Yes, NA | 2 | 0.02% | Vigorous Work Activity (1: Yes, 2: No) |
| modwrk | factor | No, Yes, NA | 2 | 0.02% | Moderate Work Activity (1: Yes, 2: No) |
| wlkbik | factor | No, Yes, NA | 2 | 0.02% | Walk or Bicycle (1: Yes, 2: No) |
| vigrecexr | factor | No, Yes, NA | 2 | 0.02% | Vigorous Recreational Activities (1: Yes, 2: No) |
| modrecexr | factor | No, Yes, NA | 2 | 0.03% | Moderate Recreational Activities (1: Yes, 2: No) |
| sedmin | integer | 480, 240, 720 | 37 | 1.22% | Minutes of Sedentary Activity per Week (0 - 840) |
| obese | factor | No, Yes, NA | 2 | 0.57% | BMI>35 (1: No, 2: Yes) |
According to CDC’s classification on bodyweight, we have: BMI<18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI>30 as obesity. We adopted this category and found that there was a slight positive relationship between bodyweight and the total cholesterol level. However, we noticed that there was a negative relationship between the HDL and bodyweight. Because of the fact that Tchol is the sum of HDL and LDL, we can conclude that the obese population has a high level of LDL and a low level HDL.
According to ATPIII (n.d.), we can also categorize the cholesterol level.
| No | Yes | ||
|---|---|---|---|
| Marital Status | Married | 2530 | 474 |
| Widowed | 418 | 86 | |
| Divorced | 528 | 112 | |
| Separated | 158 | 35 | |
| Never Married | 863 | 160 | |
| Living Together | 388 | 66 |
Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:
\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]
We use the chi-squared test to conclude that there is not enough evidence to reject the null hypothesis with a p-value equal to 0.6894. In other words, we cannot conclude that there is a relationship between obesity and marital status.
We do the same test for other variables compared with obesity. From Table 2 we can see that we can reject the independence between obesity and wlkbik, vigrecexr and modrecexr variables.
| vigwrk | modwrk | wlkbik | vigrecexr | modrecexr | |
|---|---|---|---|---|---|
| p-value | 0.5695 | 0.3037 | 1.064e-07 | 4.061e-15 | 2.573e-09 |
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
| Healthy weight (N=1883) |
Obesity (N=2311) |
Overweight (N=2127) |
Underweight (N=124) |
Overall (N=6445) |
|
|---|---|---|---|---|---|
| Gender | |||||
| Male | 897 (47.6%) | 1036 (44.8%) | 1171 (55.1%) | 40 (32.3%) | 3144 (48.8%) |
| Female | 986 (52.4%) | 1275 (55.2%) | 956 (44.9%) | 84 (67.7%) | 3301 (51.2%) |
| Age (years) | |||||
| Mean (SD) | 41.2 (20.6) | 48.7 (17.7) | 48.9 (19.0) | 37.9 (21.0) | 46.4 (19.4) |
| Median [Min, Max] | 37.0 [16.0, 80.0] | 49.0 [16.0, 80.0] | 49.0 [16.0, 80.0] | 30.0 [16.0, 80.0] | 46.0 [16.0, 80.0] |
| Marital Status | |||||
| Married | 741 (39.4%) | 1158 (50.1%) | 1074 (50.5%) | 31 (25.0%) | 3004 (46.6%) |
| Widowed | 121 (6.4%) | 185 (8.0%) | 190 (8.9%) | 8 (6.5%) | 504 (7.8%) |
| Divorced | 154 (8.2%) | 262 (11.3%) | 210 (9.9%) | 14 (11.3%) | 640 (9.9%) |
| Separated | 47 (2.5%) | 82 (3.5%) | 63 (3.0%) | 1 (0.8%) | 193 (3.0%) |
| Never Married | 351 (18.6%) | 353 (15.3%) | 289 (13.6%) | 30 (24.2%) | 1023 (15.9%) |
| Living Together | 141 (7.5%) | 148 (6.4%) | 157 (7.4%) | 8 (6.5%) | 454 (7.0%) |
| Missing | 328 (17.4%) | 123 (5.3%) | 144 (6.8%) | 32 (25.8%) | 627 (9.7%) |
| Statistical Weight | |||||
| Mean (SD) | 36700 (26000) | 33000 (25100) | 34200 (26300) | 37400 (27800) | 34600 (25800) |
| Median [Min, Max] | 26100 [5050, 154000] | 23200 [4080, 124000] | 23600 [4450, 141000] | 26500 [6840, 113000] | 24200 [4080, 154000] |
| Pseudo-PSU | |||||
| Mean (SD) | 1.51 (0.500) | 1.50 (0.500) | 1.51 (0.500) | 1.50 (0.502) | 1.51 (0.500) |
| Median [Min, Max] | 2.00 [1.00, 2.00] | 2.00 [1.00, 2.00] | 2.00 [1.00, 2.00] | 1.50 [1.00, 2.00] | 2.00 [1.00, 2.00] |
| Pseudo-stratum | |||||
| Mean (SD) | 7.11 (4.09) | 7.36 (4.13) | 7.15 (4.16) | 7.80 (4.14) | 7.22 (4.13) |
| Median [Min, Max] | 7.00 [1.00, 15.0] | 7.00 [1.00, 15.0] | 7.00 [1.00, 15.0] | 8.00 [1.00, 15.0] | 7.00 [1.00, 15.0] |
| Total Cholesterol (mg/dL) | |||||
| Mean (SD) | 185 (39.9) | 194 (40.5) | 198 (42.8) | 172 (33.4) | 192 (41.4) |
| Median [Min, Max] | 180 [92.0, 383] | 191 [92.0, 357] | 194 [90.0, 380] | 166 [108, 289] | 189 [90.0, 383] |
| Missing | 123 (6.5%) | 142 (6.1%) | 121 (5.7%) | 6 (4.8%) | 392 (6.1%) |
| HDL-Cholesterol (mg/dL) | |||||
| Mean (SD) | 58.4 (17.1) | 47.6 (13.7) | 51.8 (15.5) | 63.3 (17.1) | 52.5 (16.0) |
| Median [Min, Max] | 56.0 [11.0, 144] | 46.0 [15.0, 115] | 50.0 [16.0, 119] | 63.0 [26.0, 114] | 50.0 [11.0, 144] |
| Missing | 124 (6.6%) | 142 (6.1%) | 120 (5.6%) | 6 (4.8%) | 392 (6.1%) |
| Systolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 119 (18.5) | 125 (17.3) | 125 (18.5) | 111 (18.5) | 123 (18.3) |
| Median [Min, Max] | 116 [90.0, 220] | 124 [90.0, 200] | 122 [90.0, 208] | 106 [90.0, 220] | 120 [90.0, 220] |
| Missing | 164 (8.7%) | 206 (8.9%) | 154 (7.2%) | 20 (16.1%) | 544 (8.4%) |
| Diastolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 67.4 (11.2) | 71.3 (12.4) | 69.8 (11.8) | 65.7 (11.3) | 69.6 (11.9) |
| Median [Min, Max] | 68.0 [40.0, 118] | 72.0 [40.0, 134] | 70.0 [40.0, 118] | 66.0 [44.0, 110] | 70.0 [40.0, 134] |
| Missing | 167 (8.9%) | 230 (10.0%) | 170 (8.0%) | 18 (14.5%) | 585 (9.1%) |
| Weight (Kg) | |||||
| Mean (SD) | 63.1 (9.13) | 99.0 (17.7) | 77.3 (10.3) | 47.9 (5.56) | 80.4 (20.2) |
| Median [Min, Max] | 62.8 [38.5, 95.5] | 96.9 [57.8, 159] | 76.8 [45.5, 117] | 47.7 [33.2, 63.0] | 77.6 [33.2, 159] |
| Standing Height (cm) | |||||
| Mean (SD) | 168 (10.0) | 167 (10.4) | 168 (10.4) | 166 (7.59) | 167 (10.2) |
| Median [Min, Max] | 167 [140, 203] | 166 [135, 196] | 168 [123, 202] | 165 [147, 186] | 167 [123, 203] |
| Vigorous Work Activity | |||||
| Yes | 324 (17.2%) | 418 (18.1%) | 371 (17.4%) | 16 (12.9%) | 1129 (17.5%) |
| No | 1558 (82.7%) | 1893 (81.9%) | 1756 (82.6%) | 108 (87.1%) | 5315 (82.5%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Work Activity | |||||
| Yes | 651 (34.6%) | 796 (34.4%) | 701 (33.0%) | 32 (25.8%) | 2180 (33.8%) |
| No | 1231 (65.4%) | 1515 (65.6%) | 1426 (67.0%) | 92 (74.2%) | 4264 (66.2%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Walk or Bicycle | |||||
| Yes | 630 (33.5%) | 549 (23.8%) | 573 (26.9%) | 48 (38.7%) | 1800 (27.9%) |
| No | 1252 (66.5%) | 1762 (76.2%) | 1554 (73.1%) | 76 (61.3%) | 4644 (72.1%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Vigorous Recreational Activities | |||||
| Yes | 579 (30.7%) | 344 (14.9%) | 449 (21.1%) | 27 (21.8%) | 1399 (21.7%) |
| No | 1303 (69.2%) | 1967 (85.1%) | 1678 (78.9%) | 97 (78.2%) | 5045 (78.3%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Recreational Activities | |||||
| Yes | 834 (44.3%) | 791 (34.2%) | 823 (38.7%) | 37 (29.8%) | 2485 (38.6%) |
| No | 1048 (55.7%) | 1520 (65.8%) | 1303 (61.3%) | 87 (70.2%) | 3958 (61.4%) |
| Missing | 1 (0.1%) | 0 (0%) | 1 (0.0%) | 0 (0%) | 2 (0.0%) |
| Minutes of Sedentary Activity per Week (mins) | |||||
| Mean (SD) | 316 (185) | 333 (186) | 308 (184) | 366 (195) | 321 (186) |
| Median [Min, Max] | 300 [0, 840] | 300 [0, 840] | 300 [1.00, 840] | 300 [10.0, 840] | 300 [0, 840] |
| Missing | 17 (0.9%) | 34 (1.5%) | 26 (1.2%) | 1 (0.8%) | 78 (1.2%) |
| Obese | |||||
| No | 1883 (100%) | 1325 (57.3%) | 2127 (100%) | 124 (100%) | 5459 (84.7%) |
| Yes | 0 (0%) | 986 (42.7%) | 0 (0%) | 0 (0%) | 986 (15.3%) |
From the table 1, we can realized that there are 6445 observations with 21 variables in our data set and 8 variables can be considered as categorical variables. But in original data set, it exists 6482 observations and 37 of them are missing information for variable bmi. We deleted these missing data and use BMI level to stratified observations, since the missind data less than 0.6% in total. This data set mainly focus on the observers between 16 to 80 years old. Among them, the average weight for male is greater than female among all ages, and as we can see from the line chart that the change in average weight with age follow the same trend across the gender, with a general tendency to sustained increase, followed by fluctuation and continuous decrease finally.
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
For the observations in different marital status, we also interested
in the relationship between weight and marital status.The following box
plot shows that the average weight under different marital status are
all around 80 Kg, widowed have lowest average weight among six
categories.
In order to have improve understanding with the relationship between weight and other health related variables, we can categorize the weight level for every observation by BMI.